Dataset statistics
| Number of variables | 4 |
|---|---|
| Number of observations | 1000209 |
| Missing cells | 0 |
| Missing cells (%) | 0.0% |
| Duplicate rows | 0 |
| Duplicate rows (%) | 0.0% |
| Total size in memory | 30.5 MiB |
| Average record size in memory | 32.0 B |
Variable types
| Numeric | 3 |
|---|---|
| Categorical | 1 |
UserID is highly correlated with Timestamp | High correlation |
Timestamp is highly correlated with UserID | High correlation |
Reproduction
| Analysis started | 2022-07-14 12:45:10.035761 |
|---|---|
| Analysis finished | 2022-07-14 13:47:24.081674 |
| Duration | 1 hour, 2 minutes and 14.05 seconds |
| Software version | pandas-profiling v3.2.0 |
| Download configuration | config.json |
| Distinct | 6040 |
|---|---|
| Distinct (%) | 0.6% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Mean | 3024.512348 |
| Minimum | 1 |
|---|---|
| Maximum | 6040 |
| Zeros | 0 |
| Zeros (%) | 0.0% |
| Negative | 0 |
| Negative (%) | 0.0% |
| Memory size | 7.6 MiB |
Quantile statistics
| Minimum | 1 |
|---|---|
| 5-th percentile | 331 |
| Q1 | 1506 |
| median | 3070 |
| Q3 | 4476 |
| 95-th percentile | 5740 |
| Maximum | 6040 |
| Range | 6039 |
| Interquartile range (IQR) | 2970 |
Descriptive statistics
| Standard deviation | 1728.412695 |
|---|---|
| Coefficient of variation (CV) | 0.5714682223 |
| Kurtosis | -1.20099506 |
| Mean | 3024.512348 |
| Median Absolute Deviation (MAD) | 1465 |
| Skewness | 0.005734559099 |
| Sum | 3025144471 |
| Variance | 2987410.444 |
| Monotonicity | Increasing |
Histogram with fixed size bins (bins=50)
| Value | Count | Frequency (%) |
| 4169 | 2314 | 0.2% |
| 1680 | 1850 | 0.2% |
| 4277 | 1743 | 0.2% |
| 1941 | 1595 | 0.2% |
| 1181 | 1521 | 0.2% |
| 889 | 1518 | 0.2% |
| 3618 | 1344 | 0.1% |
| 2063 | 1323 | 0.1% |
| 1150 | 1302 | 0.1% |
| 1015 | 1286 | 0.1% |
| Other values (6030) | 984413 |
| Value | Count | Frequency (%) |
| 1 | 53 | < 0.1% |
| 2 | 129 | < 0.1% |
| 3 | 51 | < 0.1% |
| 4 | 21 | < 0.1% |
| 5 | 198 | |
| 6 | 71 | < 0.1% |
| 7 | 31 | < 0.1% |
| 8 | 139 | < 0.1% |
| 9 | 106 | < 0.1% |
| 10 | 401 |
| Value | Count | Frequency (%) |
| 6040 | 341 | < 0.1% |
| 6039 | 123 | < 0.1% |
| 6038 | 20 | < 0.1% |
| 6037 | 202 | < 0.1% |
| 6036 | 888 | |
| 6035 | 280 | < 0.1% |
| 6034 | 21 | < 0.1% |
| 6033 | 60 | < 0.1% |
| 6032 | 104 | < 0.1% |
| 6031 | 51 | < 0.1% |
MovieID
Real number (ℝ≥0)
| Distinct | 3706 |
|---|---|
| Distinct (%) | 0.4% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Mean | 1865.539898 |
| Minimum | 1 |
|---|---|
| Maximum | 3952 |
| Zeros | 0 |
| Zeros (%) | 0.0% |
| Negative | 0 |
| Negative (%) | 0.0% |
| Memory size | 7.6 MiB |
Quantile statistics
| Minimum | 1 |
|---|---|
| 5-th percentile | 172 |
| Q1 | 1030 |
| median | 1835 |
| Q3 | 2770 |
| 95-th percentile | 3675 |
| Maximum | 3952 |
| Range | 3951 |
| Interquartile range (IQR) | 1740 |
Descriptive statistics
| Standard deviation | 1096.040689 |
|---|---|
| Coefficient of variation (CV) | 0.587519297 |
| Kurtosis | -1.111020976 |
| Mean | 1865.539898 |
| Median Absolute Deviation (MAD) | 884 |
| Skewness | 0.09243570938 |
| Sum | 1865929796 |
| Variance | 1201305.193 |
| Monotonicity | Not monotonic |
Histogram with fixed size bins (bins=50)
| Value | Count | Frequency (%) |
| 2858 | 3428 | 0.3% |
| 260 | 2991 | 0.3% |
| 1196 | 2990 | 0.3% |
| 1210 | 2883 | 0.3% |
| 480 | 2672 | 0.3% |
| 2028 | 2653 | 0.3% |
| 589 | 2649 | 0.3% |
| 2571 | 2590 | 0.3% |
| 1270 | 2583 | 0.3% |
| 593 | 2578 | 0.3% |
| Other values (3696) | 972192 |
| Value | Count | Frequency (%) |
| 1 | 2077 | |
| 2 | 701 | 0.1% |
| 3 | 478 | < 0.1% |
| 4 | 170 | < 0.1% |
| 5 | 296 | < 0.1% |
| 6 | 940 | |
| 7 | 458 | < 0.1% |
| 8 | 68 | < 0.1% |
| 9 | 102 | < 0.1% |
| 10 | 888 |
| Value | Count | Frequency (%) |
| 3952 | 388 | |
| 3951 | 40 | < 0.1% |
| 3950 | 54 | < 0.1% |
| 3949 | 304 | < 0.1% |
| 3948 | 862 | |
| 3947 | 55 | < 0.1% |
| 3946 | 100 | < 0.1% |
| 3945 | 43 | < 0.1% |
| 3944 | 9 | < 0.1% |
| 3943 | 96 | < 0.1% |
Rating
Categorical
| Distinct | 5 |
|---|---|
| Distinct (%) | < 0.1% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Memory size | 7.6 MiB |
| 4 | |
|---|---|
| 3 | |
| 5 | |
| 2 | |
| 1 |
Length
| Max length | 1 |
|---|---|
| Median length | 1 |
| Mean length | 1 |
| Min length | 1 |
Characters and Unicode
| Total characters | 1000209 |
|---|---|
| Distinct characters | 5 |
| Distinct categories | 1 ? |
| Distinct scripts | 1 ? |
| Distinct blocks | 1 ? |
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.
Unique
| Unique | 0 ? |
|---|---|
| Unique (%) | 0.0% |
Sample
| 1st row | 5 |
|---|---|
| 2nd row | 3 |
| 3rd row | 3 |
| 4th row | 4 |
| 5th row | 5 |
Common Values
| Value | Count | Frequency (%) |
| 4 | 348971 | |
| 3 | 261197 | |
| 5 | 226310 | |
| 2 | 107557 | 10.8% |
| 1 | 56174 | 5.6% |
Length
Histogram of lengths of the category
Category Frequency Plot
| Value | Count | Frequency (%) |
| 4 | 348971 | |
| 3 | 261197 | |
| 5 | 226310 | |
| 2 | 107557 | 10.8% |
| 1 | 56174 | 5.6% |
Most occurring characters
| Value | Count | Frequency (%) |
| 4 | 348971 | |
| 3 | 261197 | |
| 5 | 226310 | |
| 2 | 107557 | 10.8% |
| 1 | 56174 | 5.6% |
Most occurring categories
| Value | Count | Frequency (%) |
| Decimal Number | 1000209 |
Most frequent character per category
Decimal Number
| Value | Count | Frequency (%) |
| 4 | 348971 | |
| 3 | 261197 | |
| 5 | 226310 | |
| 2 | 107557 | 10.8% |
| 1 | 56174 | 5.6% |
Most occurring scripts
| Value | Count | Frequency (%) |
| Common | 1000209 |
Most frequent character per script
Common
| Value | Count | Frequency (%) |
| 4 | 348971 | |
| 3 | 261197 | |
| 5 | 226310 | |
| 2 | 107557 | 10.8% |
| 1 | 56174 | 5.6% |
Most occurring blocks
| Value | Count | Frequency (%) |
| ASCII | 1000209 |
Most frequent character per block
ASCII
| Value | Count | Frequency (%) |
| 4 | 348971 | |
| 3 | 261197 | |
| 5 | 226310 | |
| 2 | 107557 | 10.8% |
| 1 | 56174 | 5.6% |
| Distinct | 458455 |
|---|---|
| Distinct (%) | 45.8% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Mean | 972243695.4 |
| Minimum | 956703932 |
|---|---|
| Maximum | 1046454590 |
| Zeros | 0 |
| Zeros (%) | 0.0% |
| Negative | 0 |
| Negative (%) | 0.0% |
| Memory size | 7.6 MiB |
Quantile statistics
| Minimum | 956703932 |
|---|---|
| 5-th percentile | 958704090.8 |
| Q1 | 965302637 |
| median | 973018006 |
| Q3 | 975220939 |
| 95-th percentile | 993074152.6 |
| Maximum | 1046454590 |
| Range | 89750658 |
| Interquartile range (IQR) | 9918302 |
Descriptive statistics
| Standard deviation | 12152558.94 |
|---|---|
| Coefficient of variation (CV) | 0.01249949884 |
| Kurtosis | 10.94997785 |
| Mean | 972243695.4 |
| Median Absolute Deviation (MAD) | 5308808 |
| Skewness | 2.765691163 |
| Sum | 9.724468943 × 1014 |
| Variance | 1.476846888 × 1014 |
| Monotonicity | Not monotonic |
Histogram with fixed size bins (bins=50)
| Value | Count | Frequency (%) |
| 975528402 | 30 | < 0.1% |
| 975440712 | 28 | < 0.1% |
| 975527781 | 28 | < 0.1% |
| 1025585635 | 27 | < 0.1% |
| 975528243 | 27 | < 0.1% |
| 975280276 | 26 | < 0.1% |
| 975528115 | 26 | < 0.1% |
| 975280390 | 25 | < 0.1% |
| 1025036288 | 25 | < 0.1% |
| 974698015 | 24 | < 0.1% |
| Other values (458445) | 999943 |
| Value | Count | Frequency (%) |
| 956703932 | 1 | < 0.1% |
| 956703954 | 2 | < 0.1% |
| 956703977 | 2 | < 0.1% |
| 956704056 | 5 | |
| 956704081 | 1 | < 0.1% |
| 956704191 | 3 | |
| 956704219 | 1 | < 0.1% |
| 956704257 | 3 | |
| 956704305 | 1 | < 0.1% |
| 956704448 | 1 | < 0.1% |
| Value | Count | Frequency (%) |
| 1046454590 | 1 | |
| 1046454548 | 2 | |
| 1046454443 | 1 | |
| 1046454338 | 1 | |
| 1046454320 | 1 | |
| 1046454282 | 1 | |
| 1046454260 | 1 | |
| 1046444711 | 1 | |
| 1046437932 | 1 | |
| 1046437879 | 1 |
Phik (φk)
Phik (φk) is a new and practical correlation coefficient that works consistently between categorical, ordinal and interval variables, captures non-linear dependency and reverts to the Pearson correlation coefficient in case of a bivariate normal input distribution. There is extensive documentation available here. A simple visualization of nullity by column.
Nullity matrix is a data-dense display which lets you quickly visually pick out patterns in data completion.
First rows
| UserID | MovieID | Rating | Timestamp | |
|---|---|---|---|---|
| 0 | 1 | 1193 | 5 | 978300760 |
| 1 | 1 | 661 | 3 | 978302109 |
| 2 | 1 | 914 | 3 | 978301968 |
| 3 | 1 | 3408 | 4 | 978300275 |
| 4 | 1 | 2355 | 5 | 978824291 |
| 5 | 1 | 1197 | 3 | 978302268 |
| 6 | 1 | 1287 | 5 | 978302039 |
| 7 | 1 | 2804 | 5 | 978300719 |
| 8 | 1 | 594 | 4 | 978302268 |
| 9 | 1 | 919 | 4 | 978301368 |
Last rows
| UserID | MovieID | Rating | Timestamp | |
|---|---|---|---|---|
| 1000199 | 6040 | 2022 | 5 | 956716207 |
| 1000200 | 6040 | 2028 | 5 | 956704519 |
| 1000201 | 6040 | 1080 | 4 | 957717322 |
| 1000202 | 6040 | 1089 | 4 | 956704996 |
| 1000203 | 6040 | 1090 | 3 | 956715518 |
| 1000204 | 6040 | 1091 | 1 | 956716541 |
| 1000205 | 6040 | 1094 | 5 | 956704887 |
| 1000206 | 6040 | 562 | 5 | 956704746 |
| 1000207 | 6040 | 1096 | 4 | 956715648 |
| 1000208 | 6040 | 1097 | 4 | 956715569 |